high-resolution image
- North America > United States (0.14)
- Asia > China > Shaanxi Province > Xi'an (0.06)
- Europe > Germany > Brandenburg > Potsdam (0.04)
- (2 more...)
- Law (0.70)
- Government (0.67)
- Energy > Renewable (0.31)
- Europe > Sweden > Stockholm > Stockholm (0.04)
- North America > United States > California (0.04)
- Research Report > Experimental Study (0.93)
- Research Report > New Finding (0.93)
- Health & Medicine > Therapeutic Area (0.68)
- Health & Medicine > Diagnostic Medicine > Imaging (0.46)
Optimal Transport-Guided Conditional Score-Based Diffusion Model Xiang Gu1, Liwei Y ang
Conditional score-based diffusion model (SBDM) is for conditional generation of target data with paired data as condition, and has achieved great success in image translation. However, it requires the paired data as condition, and there would be insufficient paired data provided in real-world applications.
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- (2 more...)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > Illinois (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis
As text-to-image (T2I) synthesis models increase in size, they demand higher inference costs due to the need for more expensive GPUs with larger memory, which makes it challenging to reproduce these models in addition to the restricted access to training datasets. Our study aims to reduce these inference costs and explores how far the generative capabilities of T2I models can be extended using only publicly available datasets and open-source models. To this end, by using the de facto standard text-to-image model, Stable Diffusion XL (SDXL), we present three key practices in building an efficient T2I model: (1) Knowledge distillation: we explore how to effectively distill the generation capability of SDXL into an efficient U-Net and find that self-attention is the most crucial part.
Long-Short Transformer: Efficient Transformers for Language and Vision
Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3x as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results (e.g., a moderate size of 55.8M model solely trained on 224x224 ImageNet-1K can obtain Top-1 accuracy 84.1%), while being more scalable on high-resolution images.